R Bootcamp day 1
What is R? A motivating example of data viz

Sarah Piombo
Natalia Zemlianskaia
George G. Vega Yon

August 16th, 2021

Overview

  1. What is R and Rstudio?

  2. Getting help with R.

  3. A live example with ggplot2.

Part 1: What is R?

First questions

What is R?

R logo

R is a language and environment for statistical computing and graphics. — https://r-project.org

What is RStudio?

RStudio logo

RStudio is an integrated development environment (IDE) for R. — https://rstudio.org/products/rstudio

motiondive R vs RStudio tweet
A nice way to see R vs RStudio by ModernDive (original tweet here)

R in the terminal

R + RStudio

Let’s see a live view of RStudio!…

Part 2: Hands on with ggplot2

All the code for this section can be downloaded here. The entire presentation (which contains the code) was generated using RMarkdown and can be downloaded from here.

(you will learn more about RMarkdown in day 3!)

Set-up: Loading R packages and Data

library(ggplot2)
data("diamonds")

To get help regarding a function, we can use the help("<FUNCTION NAME>") command in R, for example, if we wanted to learn more about library(), we could just type

help("library")

Or also equally valid

?"library"

(let’s checkout how does the help file looks like!)

Questions A:

  1. What other arguments does the function data() accepts?

  2. What does the function str does?

Looking at the Data

How does data look like in R? There are many ways to represent data in R. One of the most flexible (popular?) ways of doing is through data frames (in the case of “base R”, the core component of R) and tibbles (in the case of the tidyverse). Tibbles/data frames share the same structure:

For example, here is how R prints a tibble and a data.frame:

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

And a data frame version of the same data:

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

R has functions to query/ask how many rows and columns these objects have, we can use the nrow and ncol functions as follows:

# How many rows and columns?
nrow(diamonds)
ncol(diamonds)
## [1] 53940
## [1] 10

Now let’s get our hands dirty and do some visualization!

A Walk Through Example with ggplot2

The ggplot2 R package is for sure the most popular way to build plots in R. Here we will be looking at a couple of examples using the diamond dataset that we just loaded.

The overall structure of ggplot is as follows:

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Let’s see what happens if we run the following code?

ggplot(data = diamonds)

Nothing! Because we haven’t told ggplot what we want to visualize. The function only knows that we would like to work with the diamonds dataset, but it has no idea of what to plot!

Let’s try again using the following code

ggplot(data = diamonds) +
  geom_point()
Error: geom_point requires the following missing aesthetics: x and y
Run `rlang::last_error()` to see where the error occurred.

Ups! We got an error, and the error says "geom_point requires the following missing aesthetics: x and y", which means that we still need to give ggplot a bit more of information about what we would like to visualize. Saying that we want a scatter plot without indicating what are the variables is meaningless.

So let’s try again one more time and see what we get!

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

How does the color affect the price?

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price, color = color))

Now, how about clarity of the diamond?

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price, color = color)) +
  facet_wrap(~clarity)

Finally, let’s add some titles to make it look nicer

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price, color = color)) +
  facet_wrap(~clarity) + 
  labs(
    title    = "Price of Diamonds (by clarity)",
    subtitle = "data from the ggplot2 R package",
    x        = "Weight of the diamond (carat)",
    y        = "Price in US dollars",
    color    = "Color from \n J (worst) to D (best)"
    )

What else can we do?

ggwordcloud

gganimate

Get it from CRAN here: https://cran.r-project.org/package=ggwordcloud

gganimate

gganimate

Get it from CRAN here: https://cran.r-project.org/package=gganimate

ggridges

gganimate

Get it from CRAN here: https://cran.r-project.org/package=ggridges

Questions B

  1. Reproduce the last plot but this time put carat in the y axis and price in the x axis.

  2. Using the "mpg" dataset (which can be loaded using data(mpg)), draw a similar plot using the following mappings aes(x = displ, y = hwy, color = drv). Fill in the missing pieces to get the plot:

data(< DATA >)
ggplot(data = < DATA >) + 
  geom_point(mapping = < MAPPINGS >) +
  labs(
    title    = "Fuel economy data",
    subtitle = "(1999 - 2008)",
    x        = "Engine displacement (liters)",
    y        = "Highway MPG",
    color    = "Drive train"
  )

Question B 1: Solution

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = price, y = carat, color = color)) +
  facet_wrap(~clarity) + 
  labs(
    title    = "Price of Diamonds (by clarity)",
    subtitle = "data from the ggplot2 R package",
    y        = "Weight of the diamond (carat)",
    x        = "Price in US dollars",
    color    = "Color from \n J (worst) to D (best)"
    )

Question B 2: Solution

data(mpg)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
  labs(
    title    = "Fuel economy data",
    subtitle = "(1999 - 2008)",
    x        = "Engine displacement (liters)",
    y        = "Highway MPG",
    color    = "Drive train"
  )

Bonus: An example using Boxplots

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price, fill = clarity)) 

References